雖然 逐點運算 將張量中的每個元素獨立處理, 歸約模式 引入資料依賴關係,其中多個輸入元素被合併為單一輸出值(例如總和、最大值或平均值)。為了高效地實現這些操作,必須彌補資料邏輯上的二維結構與硬體記憶體中線性表示之間的差距。
1. 2D 記憶體對映
二維張量在邏輯上是網格結構,但在記憶體中實際上是線性的。了解 列優先(row-major) 與 行優先(column-major) 記憶體布局對於判斷歸約是否會連續存取記憶體位址,還是需要間隔存取至關重要。
2. 逐點運算與歸約拓撲
一個 矩陣複製 代表一種具有 $1:1$ 輸入到輸出對應關係的逐點運算。相反地,一個 歸約 是一種多對一($N:1$)的操作,需要跨執行緒共享累加結果,或在區塊內進行順序處理。
3. 維度塌縮
歸約由 軸 操作所依據的軸而定。沿著軸 1(列)與軸 0(欄)進行歸約,會根本性地改變記憶體步距模式與硬體快取命中率。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
[Short Answer] [Short Answer] matrix copy
A matrix copy is a 1:1 pointwise operation; a reduction is a many-to-one operation requiring data synchronization.
✅ Correct!
Correct! Pointwise operations (like copy) map one input to one output, whereas reductions collapse multiple inputs into a single statistic.❌ Incorrect
Think about the mapping ratio. A copy is 1:1, but a reduction (like sum) is N:1.QUESTION 2
Which memory layout is characterized by elements of the same row being stored in adjacent memory addresses?
Column-major
Row-major
Strided-major
Z-order curve
✅ Correct!
Row-major (C-style) layout stores $A[i][j]$ next to $A[i][j+1]$.❌ Incorrect
In column-major (Fortran-style), elements of the same column are contiguous.QUESTION 3
If we reduce a tensor of shape (M, N) across axis 1, what is the resulting shape?
(M, 1) or (M,)
(1, N) or (N,)
(1, 1)
(M, N)
✅ Correct!
Reducing across axis 1 collapses the columns, leaving one value per row (size M).❌ Incorrect
Axis 1 represents the column dimension in a 2D tensor.QUESTION 4
Why is 'Bias Addition' considered a pointwise operation compared to 'Softmax'?
Bias addition requires every element in a row to be summed first.
Each output element in a bias add depends only on its corresponding input element and a constant.
Bias addition is performed in global memory only.
Softmax does not involve any exponentiation.
✅ Correct!
Because each addition is independent of other elements in the tensor.❌ Incorrect
Pointwise operations lack the cross-element data dependencies found in reductions.QUESTION 5
What is the primary architectural challenge when implementing a reduction in Triton?
Writing the result back to global memory.
Communicating or 'voting' across threads to find a single value (e.g., max).
Using the address-of operator.
Handling floating point addition.
✅ Correct!
Reductions require data dependencies where threads must synchronize or share results to compute the final aggregate.❌ Incorrect
The challenge lies in the N-to-1 dependency, not simple I/O.Case Study: Architectural Analysis of Row-Wise Sum
Analyzing Memory vs. Compute Topology
You are tasked with optimizing a row-wise sum for a 1024x1024 matrix stored in row-major format. The kernel reads an entire row into SRAM before performing the reduction.
Q
How does the memory access pattern differ between a matrix copy and this row-wise sum?
Solution:
In a matrix copy, both the read and write operations are contiguous and $1:1$, allowing for high-throughput coalesced memory access. In a row-wise sum, the read is contiguous (loading the row), but the write is $N:1$, where 1024 elements produce only 1 output scalar, significantly changing the bandwidth-to-compute ratio.
In a matrix copy, both the read and write operations are contiguous and $1:1$, allowing for high-throughput coalesced memory access. In a row-wise sum, the read is contiguous (loading the row), but the write is $N:1$, where 1024 elements produce only 1 output scalar, significantly changing the bandwidth-to-compute ratio.
Q
Why is understanding row-major layout critical for this specific reduction?
Solution:
Because the reduction is row-wise, row-major layout ensures that all 1024 elements of a row are contiguous in physical RAM. If the matrix were column-major, summing a row would require strided access (jumping across memory addresses), which would significantly degrade performance due to poor cache utilization.
Because the reduction is row-wise, row-major layout ensures that all 1024 elements of a row are contiguous in physical RAM. If the matrix were column-major, summing a row would require strided access (jumping across memory addresses), which would significantly degrade performance due to poor cache utilization.